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Abstract 

We consider the problem of classifying doc- 
uments not by topic, but by overall senti- 
ment, e.g., determining whether a review 
is positive or negative. Using movie re- 
views as data, we find that standard ma- 
chine learning techniques definitively out- 
perform human-produced baselines. How- 
ever, the three machine learning methods 
we employed (Naive Bayes, maximum en- 
tropy classification, and support vector ma- 
chines) do not perform as well on sentiment 
classification as on traditional topic-based 
categorization. We conclude by examining 
factors that make the sentiment classifica- 
tion problem more challenging. 

1 Introduction 

Today, very large amounts of information are avail- 
able in on-line documents. As part of the effort to 
better organize this information for users, researchers 
have been actively investigating the problem of au- 
tomatic text categorization. 

The bulk of such work has focused on topical cat- 
egorization, attempting to sort documents accord- 
ing to their subject matter (e.g., sports vs. poli- 
tics). However, recent years have seen rapid growth 
in on-line discussion groups and review sites (e.g., 
the New York Times' Books web page) where a cru- 
cial characteristic of the posted articles is their senti- 
ment, or overall opinion towards the subject matter 
- for example, whether a product review is pos- 
itive or negative. Labeling these articles with their 
sentiment would provide succinct summaries to read- 
ers; indeed, these labels are part of the appeal and 
value-add of such sites as www.rottentomatoes.com, 
which both labels movie reviews that do not con- 
tain explicit rating indicators and normalizes the 
different rating schemes that individual reviewers 



use. Sentiment classification would also be helpful in 
business intelligence applications (e.g. MindfulEye's 
Lexant system 1 ) and recommender systems (e.g., 
Terveen et al. (1997), ratemura (2000| ) ) , where user 
input and feedback could be quickly summarized; in- 
deed, in general, free-form survey responses given in 
natural language format could be processed using 
sentiment categorization. Moreover, there are also 
potential applications to message filtering; for exam- 
ple, one might be able to use sent iment informat ion 
to recognize and discard "flames" ( |5pertus, 19971) . 

In this paper, we examine the effectiveness of ap- 
plying machine learning techniques to the sentiment 
classification problem. A challenging aspect of this 
problem that seems to distinguish it from traditional 
topic-based classification is that while topics are of- 
ten identifiable by keywords alone, sentiment can be 
expressed in a more subtle manner. For example, the 
sentence "How could anyone sit through this movie?" 
contains no single word that is obviously negative. 
(See Section ^for more examples). Thus, sentiment 
seems to require more understanding than the usual 
topic-based classification. So, apart from presenting 
our results obtained via machine learning techniques, 
we also analyze the problem to gain a better under- 
standing of how difficult it is. 

2 Previous Work 

This section briefly surveys previous work on non- 
topic-based text categorization. 

One area of research concentrates on classifying 
documents according to their source or sourc e style, 



with statistically-detected stylistic variation ( Bibcr 
198fl) serving as an important cue. Examples in- 



clude author, publisher (e.g., the New York Times vs. 
The Daily News), native-language background, and 
"brow" (e.g., high-brow vs. "popular", or low-brow) 
( [Mostcllcr and Wallace, 1984 ; Argamon-Engclson et 



~^lhttp://www. mindfuleye.com/about/lexant. htm 



1997) 



al., 1998; Ibmokiyo and Jones, 2001; Kessler et al 



Another, more related area of research is that of 
determining the genre of texts; subjective genres, 
such as "editorial", are often one of the possible 



categorie s ( Karlgren and Cutting, 1994 ; Kessler et 
al., 1997| ; |Finn et al., 2002|) . Other work explicitly 



attempts to find features indicating that subjective 



language is being used ( 


Hatzivassiloglou and Wiebe, 


2000; 


Wiebe et al., 2001 


). But, while techniques for 



genre categorization and subjectivity detection can 
help us recognize documents that express an opin- 
ion, they do not address our specific classification 
task of determining what that opinion actually is. 

Most previous research on sentiment-based classi- 
fication has been at least partially knowledge-based. 
Some of this work focuses on classifying the semantic 
orientation of individual words or phrases, using lin- 
guistic heuristics or a pre-selected set of seed words 



(|Hatzivassiloglou and McKcown, 1997| ; [Turney and 



Liftman, 2002] ). Past work on sentiment-based cat- 
egorization of entire documents has often involved 
either the use of models inspired by cognitive lin- 
guistics ( Hearst, 1992 ; Sack, 1994 ) or the manual or 
semi-manual construction of discriminant-word lex- 
icons ( Huettncr and Subasic, 2000] ; Das and Chen, 



2001 ; Tong, 2001 ). Interestingly, our baseline exper- 
iments, described in Section ^, show that humans 
may not always have the best intuition for choosing 
discriminating words. 

Turney 's ( |2002 ) work on classification of reviews 
is perhaps the closest to ours]^] He applied a spe- 
cific unsupervised learning technique based on the 
mutual information between document phrases and 
the words "excellent" and "poor", where the mu- 
tual information is computed using statistics gath- 
ered by a search engine. In contrast, we utilize sev- 
eral completely prior-knowledge-free supervised ma- 
chine learning methods, with the goal of understand- 
ing the inherent difficulty of the task. 

3 The Movie-Review Domain 

For our experiments, we chose to work with movie 
reviews. This domain is experimentally convenient 
because there are large on-line collections of such re- 
views, and because reviewers often summarize their 
overall sentiment with a machine-extractable rat- 
ing indicator, such as a number of stars; hence, we 
did not need to hand-label the data for supervised 
learning or evaluation purposes. We also note that 
Turney (2002| ) found movie reviews to be the most 



difficult of several domains for sentiment classifica- 
tion, reporting an accuracy of 65.83% on a 120- 
document set (random-choice performance: 50%). 
But we stress that the machine learning methods and 
features we use are not specific to movie reviews, and 
should be easily applicable to other domains as long 
as sufficient training data exists. 

Our data source was the Internet Movie Database 
(IMDb) archive of the rec. arts, movies, reviews 
newsgroup |] We selected only reviews where the au- 
thor rating was expressed either with stars or some 
numerical value (other conventions varied too widely 
to allow for automatic processing). Ratings were 
automatically extracted and converted into one of 
three categories: positive, negative, or neutral. For 
the work described in this paper, we concentrated 
only on discriminating between positive and negative 
sentiment. To avoid domination of the corpus by 
a small number of prolific reviewers, we imposed a 
limit of fewer than 20 reviews per author per senti- 
ment category, yielding a corpus of 752 negative and 
1301 positive reviews, with a total of 144 reviewers 
represented. This dataset will be available on-line at 



http : //www . cs . Cornell . edu/people/pabo/movie-review-data/ 



2 Indeed, although our choice of title was completely 
independent of his, our selections were eerily similar. 



4 A Closer Look At the Problem 

Intuitions seem to differ as to the difficulty of the sen- 
timent detection problem. An expert on using ma- 
chine learning for text categorization predicted rela- 
tively low performance for automatic methods. On 
the other hand, it seems that distinguishing positive 
from negative reviews is relatively easy for humans, 
especially in comparison to the standard text catego- 
rization problem, where topics can be closely related. 
One might also suspect that there are certain words 
people tend to use to express strong sentiments, so 
that it might suffice to simply produce a list of such 
words by introspection and rely on them alone to 
classify the texts. 

To test this latter hypothesis, we asked two gradu- 
ate students in computer science to (independently) 
choose good indicator words for positive and nega- 
tive sentiments in movie reviews. Their selections, 
shown in Figure |l|, seem intuitively plausible. We 
then converted their responses into simple decision 
procedures that essentially count the number of the 
proposed positive and negative words in a given doc- 
ument. We applied these procedures to uniformly- 
distributed data, so that the random-choice baseline 
result would be 50%. As shown in Figure |, the 
accuracy — percentage of documents classified cor- 
rectly — for the human-based classifiers were 58% 



http: / /reviews. imdb.com /Reviews / 





Proposed word lists 


Accuracy 


Ties 


Human 1 


positive: dazzling, brilliant, phenomenal, excellent, fantastic 
negative: suck, terrible, awful, unwatchable, hideous 


58% 


75% 


Human 2 


positive: gripping, mesmerizing, riveting, spectacular, cool, 
awesome, thrilling, badass, excellent, moving, exciting 
negative: bad, cliched, sucks, boring, stupid, slow 


64% 


39% 



Figure 1: Baseline results for human word lists. Data: 700 positive and 700 negative reviews. 





Proposed word lists 


Accuracy 


Ties 


Human 3 + stats 


positive: love, wonderful, best, great, superb, still, beautiful 
negative: bad, worst, stupid, waste, boring, ?, ! 


69% 


16% 



Figure 2: Results for baseline using introspection and simple statistics of the data (including test data). 



and 64%, respectively.^ Note that the tie rates - 
percentage of documents where the two sentiments 
were rated equally likely — are quite high^J (we chose 
a tie breaking policy that maximized the accuracy of 
the baselines). 

While the tie rates suggest that the brevity of 
the human-produced lists is a factor in the relatively 
poor performance results, it is not the case that size 
alone necessarily limits accuracy. Based on a very 
preliminary examination of frequency counts in the 
entire corpus (including test data) plus introspection, 
we created a list of seven positive and seven negative 
words (including punctuation), shown in Figure ^. 
As that figure indicates, using these words raised the 
accuracy to 69%. Also, although this third list is of 
comparable length to the other two, it has a much 
lower tie rate of 16%. We further observe that some 
of the items in this third list, such as "?" or "still", 
would probably not have been proposed as possible 
candidates merely through introspection, although 
upon reflection one sees their merit (the question 
mark tends to occur in sentences like "What was the 
director thinking?" ; "still" appears in sentences like 
"Still, though, it was worth seeing"). 

We conclude from these preliminary experiments 
that it is worthwhile to explore corpus-based tech- 
niques, rather than relying on prior intuitions, to se- 
lect good indicator features and to perform sentiment 
classification in general. These experiments also pro- 
vide us with baselines for experimental comparison; 
in particular, the third baseline of 69% might actu- 
ally be considered somewhat difficult to beat, since 
it was achieved by examination of the test data (al- 
though our examination was rather cursory; we do 
not claim that our list was the optimal set of four- 
teen words). 



4 Later experiments using these words as features for 
machine learning methods did not yield better results. 
5 This is largely due to 0-0 ties. 



5 Machine Learning Methods 

Our aim in this work was to examine whether it suf- 
fices to treat sentiment classification simply as a spe- 
cial case of topic-based categorization (with the two 
"topics" being positive sentiment and negative sen- 
timent), or whether special sentiment-categorization 
methods need to be developed. We experimented 
with three standard algorithms: Naive Bayes clas- 
sification, maximum entropy classification, and sup- 
port vector machines. The philosophies behind these 
three algorithms are quite different, but each has 
been shown to be effective in previous text catego- 
rization studies. 

To implement these machine learning algorithms 
on our document data, we used the following stan- 
dard bag-of- features framework. Let {/i, . . . , f m } be 
a predefined set of m features that can appear in 
a document; examples include the word "still" or 
the bigram "really stinks" . Let rii (d) be the num- 
ber of times fi occurs in document d. Then, each 
document d is represented by the document vector 
d := (m(d),n 2 (d), . . . ,n m {d)). 

5.1 Naive Bayes 

One approach to text classification is to assign to a 
given document d the class c* = argmax c P(c | d). 
We derive the Naive Bayes (NB) classifier by first 
observing that by Bayes' rule, 



P(c | d) 



P(c)P(d | c) 



where P(d) plays no role in selecting c*. To estimate 
the term P(d | c), Naive Bayes decomposes it by as- 
suming the /j's are conditionally independent given 
d's class: 



Pnb(c I d) 



P(d) 



ni(d)\ 



Our training method consists of relative-frequency 
estimation of P(c) and P(fi \ c), using add-onc 
smoothing. 

Despite its simplicity and the fact that its con- 
ditional independence assumption clearly does not 
hold in real-world situations, Naive Bayes-based text 
categorization still tends to perform surprisingly well 
( Lewis, 1998 ); indeed, Domingos and Pazzani (1997 ) 
show that Naive Bayes is optimal for certain problem 
classes with highly dependent features. On the other 
hand, more sophisticated algorithms might (and of- 
ten do) yield better results; we examine two such 
algorithms next. 

5.2 Maximum Entropy 

Maximum entropy classification (MaxEnt, or ME, 
for short) is an alternative technique which has 
proven effective in a number of natural lan- 



guage processing applications ( Berger et al., 1996| ). 
Nigam et al. (1999| ) show that it sometimes, but not 



always, outperforms Naive Bayes at standard text 
classification. Its estimate of P(c | d) takes the fol- 
lowing exponential form: 



-Pme(c I d) := 



1 



Z{d) 



exp 



z(d, c) 



where Z(d) is a normalization function. Fi c is a fea- 
ture/class function for feature fi and class c, defined 
as follows :[] 



,M -={J' 



rii(d) > and c' = c 
otherwise 



For instance, a particular feature/class function 
might fire if and only if the bigram "still hate" ap- 
pears and the document's sentiment is hypothesized 
to be negative.^] Importantly, unlike Naive Bayes, 
MaxEnt makes no assumptions about the relation- 
ships between features, and so might potentially per- 
form better when conditional independence assump- 
tions are not met. 

The Ai iC 's are feature- weight parameters; inspec- 
tion of the definition of Pme shows that a large Ai iC 
means that fi is considered a strong indicator for 
class c. The parameter values are set so as to maxi- 
mize the entropy of the induced distribution (hence 
the classifier's name) subject to the constraint that 
the expected values of the feature/class functions 



6 We use a restricted definition of feature/class func- 
tions so that MaxEnt relies on the same sort of feature 
information as Naive Bayes. 

7 The depen dence on class is ne cessary for parameter 
induction. See [Nigam et al. (1999 ) for additional moti- 
vation. 



with respect to the model are equal to their expected 
values with respect to the training data: the under- 
lying philosophy is that we should choose the model 
making the fewest assumptions about the data while 
still remaining consistent with it, which makes intu- 
itive sense. We use ten iterations of the improved 



iterative scaling algorithm (Delia Pietra et al., 1997) 
for parameter training (this was a sufficient num- 
ber of iterations for convergence of training-data ac- 
curacy), to gether with a Gaussian pri or to prevent 
overfitting ( Chen and Roscnfcld, 2000 ). 



5.3 Support Vector Machines 

Support vector machines (SVMs) have been shown to 
be highly effective at traditional text categorization, 



generally outperforming Naive Bayes (Joachims 
1996| ). They are large-margin, rather than proba- 
bilistic, classifiers, in contrast to Naive Bayes and 
MaxEnt. In the two-category case, the basic idea be- 
hind the training procedure is to find a hyperplane, 
represented by vector w, that not only separates 
the document vectors in one class from those in the 
other, but for which the separation, or margin, is as 
large as possible. This search corresponds to a con- 
strained optimization problem; letting Cj G {1,-1} 
(corresponding to positive and negative) be the cor- 
rect class of document dj , the solution can be written 



7 , a j c j^h 



a, > 0, 



where the ay's are obtained by solving a dual opti- 
mization problem. Those dj such that ctj is greater 
than zero are called support vectors, since they are 
the only document vectors contributing to w. Clas- 
sification of test instances consists simply of deter- 
mining which side of w's hy perplane they fall on. 

We used Joachim's ( |l999D SVM light package^ for 
training and testing, with all parameters set to their 
default values, after first length- normalizing the doc- 
ument vectors, as is standard (neglecting to normal- 
ize generally hurt performance slightly). 

6 Evaluation 

6.1 Experimental Set-up 

We used documents from the movie-review corpus 
described in Section [| To create a data set with uni- 
form class distribution (studying the effect of skewed 
class distributions was out of the scope of this study), 
we randomly selected 700 positive-sentiment and 700 
negative-sentiment documents. We then divided this 
data into three equal-sized folds, maintaining bal- 
anced class distributions in each fold. (We did not 

8 ht t p : / / s vmlight . joachims.org 





Features 


# of 
features 


frequency or 
presence? 


NB 


ME 


SVM 


(1) 


unigrams 


16165 


freq. 


78.7 


N/A 


72.8 


(21 


unigrarns 




pres. 


81.0 


80.4 


82.9 


(31 


nni PXfi ms- 1- ni crrnrris 


32330 


pres. 


80.6 


80.8 


82.7 


(4) 


bigrams 


16165 


pres. 


77.3 


77.4 


77.1 


(5) 


unigrams+POS 


16695 


pres. 


81.5 


80.4 


81.9 


(6) 


adjectives 


2633 


pres. 


77.0 


77.7 


75.1 


(7) 


top 2633 unigrams 


2633 


pres. 


80.3 


81.0 


81.4 


(8) 


unigrams+position 


22430 


pres. 


81.0 


80.1 


81.6 



Figure 3: Average three-fold cross-validation accuracies, in percent. Boldface: best performance for a given 
setting (row). Recall that our baseline results ranged from 50% to 69%. 



use a larger number of folds due to the slowness of 
the MaxEnt training procedure.) All results reported 
below, as well as the baseline results from Section [|, 
are the average three-fold cross-validation results on 
this data (of course, the baseline algorithms had no 
parameters to tune). 

To prepare the documents, we automatically re- 
moved the rating indicators and extracted the tex- 
tual information from the original HTML docu- 
ment format, treating punctuation as separate lex- 
ical items. No stemming or stoplists were used. 

One unconventional step we took was to attempt 
to model the potentially important contextual effect 
of negation: clearly "good" and "not very good" in- 
dicate opposite sentiment orientations. Adapting a 
technique of Das and Chen (2001), we added the tag 
N0T_ to every word between a negation word ( "not" , 
"isn't", "didn't", etc.) and the first punctuation 
mark following the negation word. (Preliminary ex- 
periments indicate that removing the negation tag 
had a negligible, but on average slightly harmful, ef- 
fect on performance.) 

For this study, we focused on features based on 
unigrams (with negation tagging) and bigrams. Be- 
cause training MaxEnt is expensive in the number of 
features, we limited consideration to (1) the 16165 
unigrams appearing at least four times in our 1400- 
document corpus (lower count cutoffs did not yield 
significantly different results), and (2) the 16165 bi- 
grams occurring most often in the same data (the 
selected bigrams all occurred at least seven times). 
Note that we did not add negation tags to the bi- 
grams, since we consider bigrams (and ro-grams in 
general) to be an orthogonal way to incorporate con- 
text. 



6.2 Results 

Initial unigram results The classification accu- 
racies resulting from using only unigrams as fea- 
tures are shown in line (1) of Figure ||. As a whole, 
the machine learning algorithms clearly surpass the 
random-choice baseline of 50%. They also hand- 
ily beat our two human-selected-unigram baselines 
of 58% and 64%, and, furthermore, perform well in 
comparison to the 69% baseline achieved via limited 
access to the test-data statistics, although the im- 
provement in the case of SVMs is not so large. 

On the other hand, in topic-based classification, 
all three classifiers have been reported to use bag- 
of-unigram features to achieve accuracies of 90% 
and above for part icular categories ( Joachims, 1998] ; 
Nigam ct al., 1999 ) ^ — and such results are for set- 
tings with more than two classes. This provides 
suggestive evidence that sentiment categorization is 
more difficult than topic classification, which cor- 
responds to the intuitions of the text categoriza- 
tion expert mentioned above.0 Nonetheless, we still 
wanted to investigate ways to improve our senti- 
ment categorization results; these experiments are 
reported below. 

Feature frequency vs. presence Recall that we 
represent each document d by a feature-count vector 
(ni(d), . . . , n m (d)). However, the definition of the 
MaxEnt feature/class functions Fi tC only reflects the 
presence or absence of a feature, rather than directly 
incorporating feature frequency. In order to investi- 



9 Joachims (199£ ) used stemming and sto plists: in 
some of their experiments, Nigam et al. (1999| ), like us, 



did not. 

10 We could not perform the natural experiment of at- 
tempting topic-based categorization on our data because 
the only obvious topics would be the film being reviewed; 
unfortunately, in our data, the maximum number of re- 
views per movie is 27, too small for meaningful results. 



gate whether reliance on frequency information could 
account for the higher accuracies of Naive Bayes and 
SVMs, we binarized the document vectors, setting 
n,i(d) to 1 if and only feature fi appears in d, and 
reran Naive Bayes and SVM light on these new vec- 
tors.0 

As can be seen from line (2) of Figure ||, 
better performance (much better performance for 
SVMs) is achieved by accounting only for fea- 
ture presence, not feature frequency. Interestingly, 
this is in direct opposition to the observations of 
McCallum and Nigam (1998 ) with respect to Naive 



Qtag program.^ This serves as a crude form of word 
sense disambiguation (Wilks and Stevenson, 1998): 
for example, it would distinguish the different usages 
of "love" in "I love this movie" (indicating sentiment 
orientation) versus "This is a love story" (neutral 
with respect to sentiment). However, the effect of 
this information seems to be a wash: as depicted in 
line (5) of Figure ||, the accuracy improves slightly 
for Naive Bayes but declines for SVMs, and the per- 
formance of MaxEnt is unchanged. 

Since adjectives have been a focus of previous work 



Bayes topic classification. We speculate that this in- 2000| ; Turncy, 2002) 



13 



in sentiment detection ( Hatzivassiloglou and Wiebe 



dicates a difference between sentiment and topic cat- 
egorization — perhaps due to topic being conveyed 
mostly by particular content words that tend to be 
repeated — but this remains to be verified. In any 
event, as a result of this finding, we did not incor- 
porate frequency information into Naive Bayes and 
SVMs in any of the following experiments. 

Bigrams In addition to looking specifically for 
negation words in the context of a word, we also 
studied the use of bigrams to capture more context 
in general. Note that bigrams and unigrams are 
surely not conditionally independent, meaning that 
the feature set they comprise violates Naive Bayes' 
conditional-independence assumptions; on the other 
hand, recall that this does not imply that Naive 



Bayes will necessarily do poorly (Domingos and Paz 



zam. 



1997) 



Line (3) of the results table shows that bigram 
information does not improve performance beyond 
that of unigram presence, although adding in the bi- 
grams does not seriously impact the results, even for 
Naive Bayes. This would not rule out the possibility 
that bigram presence is as equally useful a feature 
as unigram presence; in fact, Pedersen (2001 ) found 
that bigrams alone can be effective features for word 
sense disambiguation. However, comparing line (4) 
to line (2) shows that relying just on bigrams causes 
accuracy to decline by as much as 5.8 percentage 
points. Hence, if context is in fact important, as our 
intuitions suggest, bigrams are not effective at cap- 
turing it in our setting. 

Parts of speech We also experimented with ap- 
pending POS tags to every word via Oliver Mason's 



Alternatively, we could have tried integrating fre- 
quency information into MaxEnt. However, fea ture / class 



functions are traditionally defined as binary ( |Berger et 
al., 1996); hence, explicitly incorporating frequencies 
would require different functions for each co unt (or count 
bin'), making training impractical. But cf. (Nigam et al 
19990 . 



we looked at the performance 
of using adjectives alone. Intuitively, we might ex- 
pect that adjectives carry a great deal of informa- 
tion regarding a document's sentiment; indeed, the 
human-produced lists from Section ^ contain almost 
no other parts of speech. Yet, the results, shown in 
line (6) of Figure ^, are relatively poor: the 2633 
adjectives provide less useful information than uni- 
gram presence. Indeed, line (7) shows that simply 
using the 2633 most frequent unigrams is a better 
choice, yielding performance comparable to that of 
using (the presence of) all 16165 (line (2)). This may 
imply that applying explicit feature-selection algo- 
rithms on unigrams could improve performance. 

Position An additional intuition we had was that 
the position of a word in the text might make a dif- 
ference: movie reviews, in particular, might begin 
with an overall sentiment statement, proceed with 
a plot discussion, and conclude by summarizing the 
author's views. As a rough approximation to deter- 
mining this kind of structure, we tagged each word 
according to whether it appeared in the first quar- 
ter, last quarter, or middle half of the document^. 
The results (line (8)) didn't differ greatly from using 
unigrams alone, but more refined notions of position 
might be more successful. 

7 Discussion 

The results produced via machine learning tech- 
niques are quite good in comparison to the human- 
generated baselines discussed in Section ^[ In terms 
of relative performance, Naive Bayes tends to do the 
worst and SVMs tend to do the best, although the 
differences aren't very large. 



12 http: //www. english.bham.ac.uk/staff/oliver/soft- 
ware / 1 agger / in dex, h t m 



13 Turney's ( 2002 ) unsupervised algorithm uses bi- 
grams containing an adjective or an adverb. 

14 We tried a few other settings, e.g., first third vs. last 
third vs middle third, and found them to be less effective. 



On the other hand, we were not able to achieve ac- 
curacies on the sentiment classification problem com- 
parable to those reported for standard topic-based 
categorization, despite the several different types of 
features we tried. Unigram presence information 
turned out to be the most effective; in fact, none of 
the alternative features we employed provided consis- 
tently better performance once unigram presence was 
incorporated. Interestingly, though, the superiority 
of presence information in comparison to frequency 
information in our setting contradicts prev ious obser- 
vations made in t opic-classification work ( McCallum 
and Nigam, 199S| ). 

What accounts for these two differences — dif- 
ficulty and types of information proving useful - 
between topic and sentiment classification, and how 
might we improve the latter? To answer these ques- 
tions, we examined the data further. (All examples 
below are drawn from the full 2053-document cor- 
pus.) 

As it turns out, a common phenomenon in the doc- 
uments was a kind of "thwarted expectations" narra- 
tive, where the author sets up a deliberate contrast 
to earlier discussion: for example, "This film should 
be brilliant. It sounds like a great plot, the actors are 
first grade, and the supporting cast is good as well, and 
Stallone is attempting to deliver a good performance. 
However, it can't hold up" or "I hate the Spice Girls. 
...[3 things the author hates about them]... Why I saw 
this movie is a really, really, really long story, but I 
did, and one would think I'd despise every minute of 
it. But... Okay, I'm really ashamed of it, but I enjoyed 
it. I mean, I admit it's a really awful movie ...the ninth 
floor of hell. ..The plot is such a mess that it's terrible. 
But I loved it." 

In these examples, a human would easily detect 
the true sentiment of the review, but bag-of-features 
classifiers would presumably find these instances dif- 
ficult, since there are many words indicative of the 
opposite sentiment to that of the entire review. Fun- 
damentally, it seems that some form of discourse 
analysis is necessary (using more sophisticated tech- 
niques than our positional feature mentioned above) , 



or at least some way of determining the focus of each 
sentence, so that one can decide when the author is 



15 This phenomenon is related to another common 
theme, that of "a good actor trapped in a bad movie": 
"AN AMERICAN WEREWOLF IN PARIS is a failed at- 
tempt... Julie Delpy is far too good for this movie. She im- 
bues Serafine with spirit, spunk, and humanity. This isn't 
necessarily a good thing, since it prevents us from relax- 
ing and enjoying AN AMERICAN WEREWOLF IN PARIS 
as a completely mindless, campy entertainment experience. 
Delpy's injection of class into an otherwise classless produc- 
tion raises the specter of what this film could have been 
with a better script and a better cast ... She was radiant, 
charismatic, and effective 



talking about the film itself. (Turney (2002) makes 
a similar point, noting that for reviews, "the whole 
is not necessarily the sum of the parts".) Further- 
more, it seems likely that this thwarted-expectations 
rhetorical device will appear in many types of texts 
(e.g., editorials) devoted to expressing an overall 
opinion about some topic. Hence, we believe that an 
important next step is the identification of features 
indicating whether sentences are on-topic (which is 
a kind of co-reference problem); we look forward to 
addressing this challenge in future work. 
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